chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400
Open
donriddo wants to merge 41 commits into
Open
chore[skiplog]: Qwen3.5 perf benchmark suite (reasoning-budget, ppTPS, desktop + mobile)#2400donriddo wants to merge 41 commits into
donriddo wants to merge 41 commits into
Conversation
Contributor
|
just for mobile, can you run the bench on both CPU and GPU? |
Contributor
Tier-based Approval Status |
Contributor
Author
Already does. |
This comment was marked as resolved.
This comment was marked as resolved.
…-llm-suite # Conflicts: # .github/workflows/integration-mobile-test-llm-llamacpp.yml
A comparison requested via compare_run_id renders delta columns against a baseline run. When the baseline produced no benchmark rows (e.g. only its run-meta/desktop-meta metadata artifacts were downloaded), the comparison was silently empty: the report rendered with no deltas and the job went green even though the requested comparison was never produced. render-report.js now exits non-zero when compareDir is set but the baseline has zero rows. This is distinct from a baseline that has rows but none matching the current devices, which still renders a per-device note.
The grid is 2 x 5 x 7 after the TurboQuant/PolarQuant expansion, not 2 x 5 x 3.
maxim-smotrov
previously approved these changes
Jun 10, 2026
jesusmb1995
previously approved these changes
Jun 11, 2026
jesusmb1995
left a comment
Contributor
There was a problem hiding this comment.
Only possible .html report missing, maybe can be done in follow up.
The consolidated report is now over a thousand rows, which is hard to scan. render-report.js gains two visual outputs: - A Charts section embedded in the markdown as Mermaid xychart bars, so a device throughput ranking and the KV-cache / quantization comparison for the fastest device render inline in the GitHub step summary. - A --html output that writes a self-contained file (inline SVG, no deps or CDN) with the full per-device grouped charts and stddev error bars. The summarize job emits both; the markdown points viewers to the HTML file (uploaded with the report artifacts) for the full per-device view.
Each mobile shard loads the model once per backend (gpu, cpu) and sweeps both reasoning-budget values on it. The warm-up was inside that reasoning-budget loop, so every backend warmed up twice. But the warm-up only primes the GPU kernels/caches for the loaded model, which the reasoning budget (a per-call generation param) does not change, so the second warm-up was pure overhead (~47s gpu / ~23s cpu per shard, discarded). Warm up once per backend; the three measured repetitions and their mean/stddev for TTFT, TPS and ppTPS are unchanged.
jesusmb1995
previously approved these changes
Jun 11, 2026
jpgaribotti
previously approved these changes
Jun 11, 2026
The Stamp desktop device step interpolated the nvidia-smi GPU name directly into the printf inside its run block. Route it through a GPU_NAME env var so the value reaches the shell as data rather than as expanded workflow syntax, matching the env-mapping already used for the dispatch inputs elsewhere in this workflow. Keeps the no-interpolation-into-run-blocks invariant uniform across every step.
maxim-smotrov
previously approved these changes
Jun 11, 2026
Contributor
Author
|
/review |
The mobile chart helpers averaged a metric over every row sharing a (device, category) key, so a single bar blended both backends (gpu and cpu), both model sizes and both reasoning budgets — a value no real configuration produced — and its stddev whisker was the spread across those blended configs, not the measured 3-rep noise. Charts now hold every axis but the one on the x-axis at a fixed value (size 2B, reasoning budget -1, and the non-varied categorical at its default: weights Q4_K_M for KV-cache charts, KV f16 for the quantization chart), so each bar is one measured configuration and its whisker is that config's own 3-rep stddev. gpu and cpu are charted separately and never blended, with a shared y-scale per metric. The inline mermaid is reduced to one device-ranking chart at a single stated config. Crashed configs remain missing bars rather than zeros, and the download note now names the real artifact (qwen35-benchmark-findings) and the file inside it.
Coverage compared the reported shards against the renderer's CURRENT matrix, so re-rendering an older run after the matrix grew showed it as falsely incomplete: a complete 30-shard run read 30/70 against today's 70-shard matrix. The stamp-version job now records the run's expected shard list into run-meta.json alongside the addon version, and coverage scores against that stamped list when present, falling back to the live matrix only for runs that predate the stamp. A re-render of a stamped run is therefore always scored against the matrix it actually targeted, while genuinely missing shards are still surfaced.
The report's chart note told readers to open qwen35-benchmark-charts.html but gave no link, so they had to scroll to the run's artifacts section and download it by hand. The renderer now takes an optional --charts-url and, when given, renders the artifact mention as a markdown link. The summarize job uploads the report first so the artifact's download URL is known, then substitutes that URL into the note before writing the run summary (falling back to the run page URL if the upload yields none). Local renders pass no URL and keep the plain text, so there is never a dangling link.
…ebuilds The desktop sweep ran on the GitHub-hosted GPU runner and built the addon from source, using disk-cleanup hacks (docker prune, rm -rf /opt/...) meant for ephemeral runners — destructive on a shared persistent runner. Move it to the self-hosted qvac-ubuntu2204-x64-gpu runner the integration tests use, and download the linux-x64 binary the prebuild job already produces instead of compiling on the runner. This adds the Manual Workspace Cleanup self-hosted runners need and drops the source build, the destructive disk cleanup, and the LLVM/Vulkan/vcpkg setup. The prebuild job now also runs for desktop-only dispatches so the binary is available to download.
The summarize job fetched the report artifacts with actions/download-artifact, which verifies the artifact digest and was failing with `digest-mismatch` on otherwise-intact artifacts (the gh CLI downloads the same files without issue). Under continue-on-error that left the input directory silently empty, so the render step reported a misleading "no benchmark reports found" and exited. Switch the current-run and baseline downloads to `gh run download`, which pulls the artifacts by name prefix and run id without the digest check, and emits a warning rather than masking a real failure.
The summarize job downloads the report artifacts with `gh run download`, which calls the Actions artifacts API and needs actions:read. The job only granted contents:read, so the download returned nothing and the render step reported "no benchmark reports found". Add actions:read.
…h artifacts" This reverts commit 1bedeb5.
…mmarize" This reverts commit 1b7f6b3.
…-llm-suite # Conflicts: # .github/workflows/integration-mobile-test-llm-llamacpp.yml
prebuild now needs verify-shards, so a benchmark-shard matrix drift fails the run in ~30s instead of after the expensive prebuild. !cancelled() + the result check keep a desktop-only run (where verify-shards is skipped) working. The verify-shards comment is corrected to match.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🎯 What problem does this PR solve?
The WB team needs throughput numbers (TTFT, TPS, ppTPS) for Qwen3.5-0.8B and 2B across quantizations Q4_0, Q4_1, Q4_K_M, Q6_K, Q8_0 and reasoning-budget -1/0, on both desktop and mobile including KV-cache types on mobile — plus the ability to catch regressions between addon versions.
📝 How does it solve it?
Coverage
Report — unified renderer (
render-report.js), one identical table per device (desktop + 5 mobile):Crashedrows for unsupported combos (e.g. quantized KV cache on Adreno GPUs, or TurboQuant/PolarQuant on iOS Metal and Samsung GPU — run anyway, detected, reported).Cross-run comparison (regression detection)
summarize_onlyre-renders a previous run's report in ~1 min, skipping the ~6h benchmarks.compare_run_idadds Δ TTFT / TPS / ppTPS columns vs a baseline run (downloads both runs' artifacts; no re-run needed). The baseline's version is read from its stamp, so the comparison is never mislabelled.Mobile execution
test_groupsare generated from one source of truth (test/integration/_benchmark-matrix.js) and are not committed. CI regenerates them before the Device Farm bundle and hard-fails if any are missing or have drifted from the matrix, so the benchmark can never run against a stale or partial shard set.test-groups.json; scheduled only via the workflow'stest_groupsoverride.Workflow inputs (no per-run configurability of the matrix — it's fixed in the scripts):
ref,run_desktop,run_mobile,summarize_only,artifact_run_id,compare_run_id. The sharedintegration-mobile-test-llm-llamacpp.ymlgains two additive optional inputs (job_timeout_minutesdefault 120,artifact_suffixdefault empty) — backward-compatible for other addon callers.🧪 How was it tested?
npx standardclean;validate-mobile-tests.jsin sync;verify:benchmark-shardsconfirms the matrix, the generatedintegration.auto.cjs(shard-file refs and run-function names), and the workflowtest_groupsare all in lockstep, so a generator change can't silently desync the Device Farm grep.test:integration:generateregenerates everything with zero drift in the committedintegration.auto.cjs; the mobile-only benchmark shards skip cleanly on desktop.Desktop (NVIDIA RTX 4000 SFF Ada Generation),desktop=5).mobile=3with mean ± stddev and best-config per device. iPhone 16/17 report the full 70/70; the combos their GPUs don't support (Adreno quantized-KV, TurboQuant on Metal) surface asCrashedrows / coverage gaps, as intended.💥 Known findings from the runs (data, not code issues)
gpu + kv=q4_0andgpu + kv=q8_0— confirmed and reported asCrashed. CPU path handles quantized KV fine.± stddevon those rows. This is genuine sustained-load throttling on real devices, not measurement error — the stddev reflects it honestly.📦 Notes
index.js/native or public-API change, so no version bump or CHANGELOG entry ([skiplog]).#2382(workflow infra, already merged).